UK North Sea
Ensemble Performance Through the Lens of Linear Independence of Classifier Votes in Data Streams
Ensemble learning improves classification performance by combining multiple base classifiers. While increasing the number of classifiers generally enhances accuracy, excessively large ensembles can lead to computational inefficiency and diminishing returns. This paper investigates the relationship between ensemble size and performance through the lens of linear independence among classifier votes in data streams. We propose that ensembles composed of linearly independent classifiers maximize representational capacity, particularly under a geometric model. We then generalize the importance of linear independence to the weighted majority voting problem. By modeling the probability of achieving linear independence among classifier outputs, we derive a theoretical framework that explains the trade-off between ensemble size and accuracy. Our analysis leads to a theoretical estimate of the ensemble size required to achieve a user-specified probability of linear independence. We validate our theory through experiments on both real-world and synthetic datasets using two ensemble methods, OzaBagging and GOOWE. Our results confirm that this theoretical estimate effectively identifies the point of performance saturation for robust ensembles like OzaBagging. Conversely, for complex weighting schemes like GOOWE, our framework reveals that high theoretical diversity can trigger algorithmic instability. Our implementation is publicly available to support reproducibility and future research.
- North America > United States (0.04)
- Europe > United Kingdom > UK North Sea (0.04)
- Atlantic Ocean > North Atlantic Ocean > North Sea > UK North Sea (0.04)
- (3 more...)
Multi-Label Transfer Learning in Non-Stationary Data Streams
Du, Honghui, Minku, Leandro, Lawlor, Aonghus, Zhou, Huiyu
Abstract--Label concepts in multi-label data streams often experience drift in non-stationary environments, either independently or in relation to other labels. Transferring knowledge between related labels can accelerate adaptation, yet research on multi-label transfer learning for data streams remains limited. T o address this, we propose two novel transfer learning methods: BR-MARLENE leverages knowledge from different labels in both source and target streams for multi-label classification; BRPW-MARLENE builds on this by explicitly modelling and transferring pairwise label dependencies to enhance learning performance. Comprehensive experiments show that both methods outperform state-of-the-art multi-label stream approaches in non-stationary environments, demonstrating the effectiveness of inter-label knowledge transfer for improved predictive performance. Index T erms--Concept drift, non-stationary environment, multi-source, multi-label, class imbalance, transfer learning. Most research on data stream learning concentrates on streams with single labels [1]. However, many practical data streaming applications naturally adopt a multi-label paradigm, where each incoming data example has more than one label [2]. For example, a social media post could be tagged with several descriptors, or a movie might be classified under various predefined genres (e.g., Action, Crime, Historical), with each tag or genre representing a unique label.
- Europe > United Kingdom > England > Leicestershire > Leicester (0.04)
- North America > United States (0.04)
- Europe > United Kingdom > UK North Sea (0.04)
- (5 more...)
CCD: Continual Consistency Diffusion for Lifelong Generative Modeling
Liu, Jingren, Xu, Shuning, Wang, Yun, Ji, Zhong, Chen, Xiangyu
While diffusion-based models have shown remarkable generative capabilities in static settings, their extension to continual learning (CL) scenarios remains fundamentally constrained by Generative Catastrophic Forgetting (GCF). We observe that even with a rehearsal buffer, new generative skills often overwrite previous ones, degrading performance on earlier tasks. Although some initial efforts have explored this space, most rely on heuristics borrowed from continual classification methods or use trained diffusion models as ad hoc replay generators, lacking a principled, unified solution to mitigating GCF and often conducting experiments under fragmented and inconsistent settings. To address this gap, we introduce the Continual Diffusion Generation (CDG), a structured pipeline that redefines how diffusion models are implemented under CL and enables systematic evaluation of GCF. Beyond the empirical pipeline, we propose the first theoretical foundation for CDG, grounded in a cross-task analysis of diffusion-specific generative dynamics. Our theoretical investigation identifies three fundamental consistency principles essential for preserving knowledge in the rehearsal buffer over time: inter-task knowledge consistency, unconditional knowledge consistency, and prior knowledge consistency. These criteria expose the latent mechanisms through which generative forgetting manifests across sequential tasks. Motivated by these insights, we further propose \textit{Continual Consistency Diffusion} (CCD), a principled training framework that enforces these consistency objectives via hierarchical loss functions: $\mathcal{L}_{IKC}$, $\mathcal{L}_{UKC}$, and $\mathcal{L}_{PKC}$. Extensive experiments show that CCD achieves SOTA performance across various benchmarks, especially improving generative metrics in overlapping-task scenarios.
- Europe > United Kingdom > UK North Sea (0.25)
- Atlantic Ocean > North Atlantic Ocean > North Sea > UK North Sea (0.25)
- North America > Canada > Ontario > Toronto (0.14)
- (4 more...)
Language and Knowledge Representation: A Stratified Approach
It can have serious implications in critical application scenarios like that of Knowledge Graph-based multilingual data integration. In view of the above, the thesis argues that the current understanding of the problem of semantic heterogeneity as the'existence of variance', while being crucially necessary, is not sufficient and under-characterized. There can be no variance without a prior notion of a unifying reference taken as the basis for computing the variance itself. To that end, the thesis proposes the problem of representation heterogeneity to emphasize the fact that heterogeneity is an intrinsic property of any representation, wherein, different observers encode different representations of the same target reality in a stratified manner using different concepts, language and knowledge (as well as data). The thesis then advances a top-down solution approach to the above stratified problem of representation heterogeneity in terms of several solution components, namely: (i) a representation formalism stratified into concept level, language level, knowledge level and data level to accommodate representation heterogeneity, (ii) a top-down language representation using Universal Knowledge Core (UKC), UKC namespaces and domain languages to tackle the conceptual and language level heterogeneity, (iii) a top-down knowledge representation using the notions of language teleontology and knowledge teleontology to tackle the knowledge level heterogeneity, (iv) the usage and further development of the existing LiveKnowledge catalog for enforcing iterative reuse and sharing of language and knowledge representations, and, (v) the kTelos methodology integrating the solution components above to iteratively generate the language and knowledge representations absolving representation heterogeneity. The thesis also includes proof-of-concepts of the language and knowledge representations developed for two international research projects - DataScientia (data catalogs) and JIDEP (materials modelling). Finally, the thesis concludes with future lines of research.
- Europe > United Kingdom > UK North Sea (0.52)
- Atlantic Ocean > North Atlantic Ocean > North Sea > UK North Sea (0.52)
- Europe > Italy > Trentino-Alto Adige/Südtirol > Trentino Province > Trento (0.04)
- (12 more...)
- Research Report (1.00)
- Workflow (0.67)
- Summary/Review (0.67)
- Health & Medicine > Consumer Health (0.67)
- Health & Medicine > Therapeutic Area (0.45)
- Health & Medicine > Health Care Technology (0.45)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
- (2 more...)
GASP: Unifying Geometric and Semantic Self-Supervised Pre-training for Autonomous Driving
Ljungbergh, William, Lilja, Adam, Ling, Adam Tonderski. Arvid Laveno, Lindström, Carl, Verbeke, Willem, Fu, Junsheng, Petersson, Christoffer, Hammarstrand, Lars, Felsberg, Michael
Self-supervised pre-training based on next-token prediction has enabled large language models to capture the underlying structure of text, and has led to unprecedented performance on a large array of tasks when applied at scale. Similarly, autonomous driving generates vast amounts of spatiotemporal data, alluding to the possibility of harnessing scale to learn the underlying geometric and semantic structure of the environment and its evolution over time. In this direction, we propose a geometric and semantic self-supervised pre-training method, GASP, that learns a unified representation by predicting, at any queried future point in spacetime, (1) general occupancy, capturing the evolving structure of the 3D scene; (2) ego occupancy, modeling the ego vehicle path through the environment; and (3) distilled high-level features from a vision foundation model. By modeling geometric and semantic 4D occupancy fields instead of raw sensor measurements, the model learns a structured, generalizable representation of the environment and its evolution through time. We validate GASP on multiple autonomous driving benchmarks, demonstrating significant improvements in semantic occupancy forecasting, online mapping, and ego trajectory prediction. Our results demonstrate that continuous 4D geometric and semantic occupancy prediction provides a scalable and effective pre-training paradigm for autonomous driving. For code and additional visualizations, see \href{https://research.zenseact.com/publications/gasp/.
- Europe > United Kingdom > UK North Sea (0.04)
- Atlantic Ocean > North Atlantic Ocean > North Sea > UK North Sea (0.04)
- Transportation > Ground > Road (1.00)
- Information Technology > Robotics & Automation (1.00)
- Automobiles & Trucks (1.00)
On-device edge learning for IoT data streams: a survey
Lourenço, Afonso, Rodrigo, João, Gama, João, Marreiros, Goreti
In today's interconnected world, nearly every electronic device is transmitting data over the internet, whether intentionally or not. The Internet of Things (Io T) continues to evolve, enabling the optimization of processes across a wide range of domains [144]. While initially, only servers had the necessary computing power for advanced analytics, as technology evolved, smaller devices had competing power for some applications, eliminating network delays in areas where critical decisions must be made in an instant. This shift in data generation and utilization gives rise to two key paradigms: ubiquitous computing, which refers to the pervasive presence of processing power throughout our environments, making them more interconnected and intelligent; and edge computing, which emphasizes the location of data processing by moving computation closer to the data source, reducing reliance on centralized cloud infrastructures. In particular, due to the widespread adoption of relational databases in these domains, tabular data is the dominant modality in these Io T applications. Organized into rows and columns, consisting of distinct features that are typically continuous, categorical, or ordinal, data arrives continuously as an infinite data stream.
- Europe > United Kingdom > UK North Sea (0.04)
- Atlantic Ocean > North Atlantic Ocean > North Sea > UK North Sea (0.04)
- Europe > Portugal (0.04)
- (6 more...)
- Information Technology (1.00)
- Education > Educational Setting (0.67)
A Survey of Text Classification Under Class Distribution Shift
Costache, Adriana Valentina, Gheorghe, Silviu Florin, Poesina, Eduard Gabriel, Irofti, Paul, Ionescu, Radu Tudor
The basic underlying assumption of machine learning (ML) models is that the training and test data are sampled from the same distribution. However, in daily practice, this assumption is often broken, i.e.~the distribution of the test data changes over time, which hinders the application of conventional ML models. One domain where the distribution shift naturally occurs is text classification, since people always find new topics to discuss. To this end, we survey research articles studying open-set text classification and related tasks. We divide the methods in this area based on the constraints that define the kind of distribution shift and the corresponding problem formulation, i.e.~learning with the Universum, zero-shot learning, and open-set learning. We next discuss the predominant mitigation approaches for each problem setup. Finally, we identify several future work directions, aiming to push the boundaries beyond the state of the art. Interestingly, we find that continual learning can solve many of the issues caused by the shifting class distribution. We maintain a list of relevant papers at https://github.com/Eduard6421/Open-Set-Survey.
- Europe > United Kingdom > UK North Sea (0.05)
- Atlantic Ocean > North Atlantic Ocean > North Sea > UK North Sea (0.05)
- Europe > Romania > București - Ilfov Development Region > Municipality of Bucharest > Bucharest (0.04)
- (2 more...)
- Education > Educational Technology > Educational Software > Computer Based Training (0.34)
- Education > Educational Setting > Online (0.34)
On the Structural Memory of LLM Agents
Zeng, Ruihong, Fang, Jinyuan, Liu, Siwei, Meng, Zaiqiao
Memory plays a pivotal role in enabling large language model~(LLM)-based agents to engage in complex and long-term interactions, such as question answering (QA) and dialogue systems. While various memory modules have been proposed for these tasks, the impact of different memory structures across tasks remains insufficiently explored. This paper investigates how memory structures and memory retrieval methods affect the performance of LLM-based agents. Specifically, we evaluate four types of memory structures, including chunks, knowledge triples, atomic facts, and summaries, along with mixed memory that combines these components. In addition, we evaluate three widely used memory retrieval methods: single-step retrieval, reranking, and iterative retrieval. Extensive experiments conducted across four tasks and six datasets yield the following key insights: (1) Different memory structures offer distinct advantages, enabling them to be tailored to specific tasks; (2) Mixed memory structures demonstrate remarkable resilience in noisy environments; (3) Iterative retrieval consistently outperforms other methods across various scenarios. Our investigation aims to inspire further research into the design of memory systems for LLM-based agents.
- Europe > Austria > Vienna (0.14)
- Asia > Cambodia (0.04)
- Europe > United Kingdom > UK North Sea (0.04)
- (4 more...)
- Media (0.46)
- Leisure & Entertainment (0.46)
Crowdsourcing Lexical Diversity
Khalilia, Hadi, Otterbacher, Jahna, Bella, Gabor, Noortyani, Rusma, Darma, Shandy, Giunchiglia, Fausto
Lexical-semantic resources (LSRs), such as online lexicons or wordnets, are fundamental for natural language processing applications. In many languages, however, such resources suffer from quality issues: incorrect entries, incompleteness, but also, the rarely addressed issue of bias towards the English language and Anglo-Saxon culture. Such bias manifests itself in the absence of concepts specific to the language or culture at hand, the presence of foreign (Anglo-Saxon) concepts, as well as in the lack of an explicit indication of untranslatability, also known as cross-lingual \emph{lexical gaps}, when a term has no equivalent in another language. This paper proposes a novel crowdsourcing methodology for reducing bias in LSRs. Crowd workers compare lexemes from two languages, focusing on domains rich in lexical diversity, such as kinship or food. Our LingoGap crowdsourcing tool facilitates comparisons through microtasks identifying equivalent terms, language-specific terms, and lexical gaps across languages. We validated our method by applying it to two case studies focused on food-related terminology: (1) English and Arabic, and (2) Standard Indonesian and Banjarese. These experiments identified 2,140 lexical gaps in the first case study and 951 in the second. The success of these experiments confirmed the usability of our method and tool for future large-scale lexicon enrichment tasks.
- Europe > United Kingdom > UK North Sea (0.05)
- Atlantic Ocean > North Atlantic Ocean > North Sea > UK North Sea (0.05)
- Europe > Italy > Trentino-Alto Adige/Südtirol > Trentino Province > Trento (0.04)
- (31 more...)
SlimSeiz: Efficient Channel-Adaptive Seizure Prediction Using a Mamba-Enhanced Network
Lu, Guorui, Peng, Jing, Huang, Bingyuan, Gao, Chang, Stefanov, Todor, Hao, Yong, Chen, Qinyu
Epileptic seizures cause abnormal brain activity, and their unpredictability can lead to accidents, underscoring the need for long-term seizure prediction. Although seizures can be predicted by analyzing electroencephalogram (EEG) signals, existing methods often require too many electrode channels or larger models, limiting mobile usability. This paper introduces a SlimSeiz framework that utilizes adaptive channel selection with a lightweight neural network model. SlimSeiz operates in two states: the first stage selects the optimal channel set for seizure prediction using machine learning algorithms, and the second stage employs a lightweight neural network based on convolution and Mamba for prediction. On the Children's Hospital Boston-MIT (CHB-MIT) EEG dataset, SlimSeiz can reduce channels from 22 to 8 while achieving a satisfactory result of 94.8% accuracy, 95.5% sensitivity, and 94.0% specificity with only 21.2K model parameters, matching or outperforming larger models' performance. We also validate SlimSeiz on a new EEG dataset, SRH-LEI, collected from Shanghai Renji Hospital, demonstrating its effectiveness across different patients. The code and SRH-LEI dataset are available at https://github.com/guoruilu/SlimSeiz.
- Asia > China > Shanghai > Shanghai (0.26)
- Europe > Netherlands > South Holland > Leiden (0.05)
- Europe > United Kingdom > UK North Sea (0.04)
- (2 more...)